Utility and Disclosure Risk Metrics and Synthetic Data Case Studies

Published

July 7, 2023

Code
options(scipen = 999)

library(tidyverse)
library(gt)
library(palmerpenguins)
library(urbnthemes)
library(here)

set_urbn_defaults()

create_table <- function(data_df, 
                         rowname_col = NA,
                         fig_num = "",
                         title_text = ""){
  # random_id = random_id(n=10)
  random_id = "urban_table"
  
  basic_table = data_df |> 
    gt(id = random_id, rowname_col = rowname_col) |> 
    tab_options(#table.width = px(760),
      table.align = "left", 
      heading.align = "left",
      # TODO: Discuss with Comms whether border should extend across 
      # whole row at bottom or just across data cells
      table.border.top.style = "hidden",
      table.border.bottom.style = "transparent",
      heading.border.bottom.style = "hidden",
      # Need to set this to transparent so that cells_borders of the cells can display properly and 
      table_body.border.bottom.style = "transparent",
      table_body.border.top.style = "transparent",
      # column_labels.border.bottom.style = "transparent",
      column_labels.border.bottom.width = px(1),
      column_labels.border.bottom.color = "black",
      # row_group.border.top.style = "hidden",
      # Set font sizes
      heading.title.font.size = px(13),
      heading.subtitle.font.size = px(13),
      column_labels.font.size = px(13),
      table.font.size = px(13),
      source_notes.font.size = px(13),
      footnotes.font.size = px(13),
      # Set row group label and border options
      row_group.font.size = px(13),
      row_group.border.top.style = "transparent",
      row_group.border.bottom.style = "hidden",
      stub.border.style = "dashed",
    ) |> 
    tab_header(
      title = fig_num,# "eyebrow",
      subtitle = title_text) |>  #"Top 10 Banks (by Dollar Volume) for Community Development Lending") |> 
    # Bold title, subtitle, and columns
    tab_style(
      style = cell_text(color = "black", weight = "bold", align = "left"),
      locations = cells_title("subtitle")
    ) |> 
    tab_style(
      style = cell_text(color = "#696969", weight = "normal", align = "left", transform = "uppercase"),
      locations = cells_title("title")
    ) |> 
    tab_style(
      style = cell_text(color = "black", weight = "bold", size = px(13)),
      locations = cells_column_labels(gt::everything())
    ) |> 
    # Italicize row group and column spanner text
    tab_style(
      style = cell_text(color = "black", style = "italic", size  = px(13)),
      locations = gt::cells_row_groups()
    ) |> 
    tab_style(
      style = cell_text(color = "black", style = "italic", size  = px(13)),
      locations = gt::cells_column_spanners()
    ) |> 
    opt_table_font(
      font = list(
        google_font("Lato"),
        default_fonts()
      )
    ) |> 
    # Adjust cell borders for all cells, small grey bottom border, no top border
    tab_style(
      style = list(
        cell_borders(
          sides = c("bottom"),
          color = "#d2d2d2",
          weight = px(1)
        )
      ),
      locations = list(
        cells_body(
          columns =  gt::everything()
          # rows = gt::everything()
        )
      )
    )  |>
    tab_style(
      style = list(
        cell_borders(
          sides = c("top"),
          color = "#d2d2d2",
          weight = px(0)
        )
      ),
      locations = list(
        cells_body(
          columns =  gt::everything()
          # rows = gt::everything()
        )
      )
    )  |>
    # Set missing value defaults
    fmt_missing(columns = gt::everything(), missing_text = "...") |>
    # Set css for all the things we can't finetune exactly in gt, mostly t/r/b/l padding
    opt_css(
      css = str_glue("
      #{random_id} .gt_row {{
        padding: 5px 5px 5px 5px;
      }}
      #{random_id} .gt_sourcenote {{
        padding: 16px 0px 0px 0px;
      }}
      #{random_id} .gt_footnote {{
        padding: 16px 0px 0px 0px;
      }}
      #{random_id} .gt_subtitle {{
        padding: 0px 0px 2px 0px;
      }}
      #{random_id} .gt_col_heading {{
        padding: 10px 5px 10px 5px;
      }}
      #{random_id} .gt_col_headings {{
        padding: 0px 0px 0px 0px;
        border-top-width: 0px;
      }}
      #{random_id} .gt_group_heading {{
        padding: 15px 0px 0px 0px;
      }}
      #{random_id} .gt_stub {{
        border-bottom-width: 1px;
        border-bottom-style: solid;
        border-bottom-color: #d2d2d2;
        border-top-color: black;
        text-align: left;
      }}
      #{random_id} .gt_grand_summary_row {{
        border-bottom-width: 1px;
        border-top-width: 1px;
        border-bottom-style: solid;
        border-bottom-color: #d2d2d2;
        border-top-color: #d2d2d2;
      }}
      #{random_id} .gt_summary_row {{
        border-bottom-width: 1px;
        border-top-width: 1px;
        border-bottom-style: solid;
        border-bottom-color: #d2d2d2;
      }}
      #{random_id} .gt_column_spanner {{
        padding-top: 10px;
        padding-bottom: 10px;
      }}
      ") |> as.character()
    )
  
  return(basic_table)
}


Review

What’s the difference between partially synthetic data and fully synthetic data?

What’s the difference between partially synthetic data and fully synthetic data?

Partially synthetic data contains unaltered and synthesized variables. In partially synthetic data, there remains a one-to-one mapping between confidential records and synthetic records.

Fully synthetic data only contains synthesized variables. Fully synthetic data no longer directly map onto the confidential records, but remain statistically representative. Since fully synthetic data does not contain any actual observations, it protects against both attribute and identity disclosure.

Note

Sequential synthesis

In a perfect world, we would synthesize data by directly modeling the joint distribution of the variables of interest. Unfortunately, this is computationally infeasible.

Instead, we often decompose a joint distribution into a marginal distribution and a sequence of conditional distributions.

What’s the difference between specific utility and general utility?

What’s the difference between specific utility and general utility?

Specific Utility measures the similarity of results for a specific analysis (or analyses) of the confidential and public data (e.g., comparing the coefficients in regression models).

General Utility measures the univariate and multivariate distributional similarity between the confidential data and the public data (e.g., sample means, sample variances, and the variance-covariance matrix).

General Utility Metrics

  • As a refresher, general utility metrics measure the distributional similarity (i.e., all statistical properties) between the original and synthetic data.

  • General utility metrics are useful because they can provide a sense of how “fit for use” your synthetic data is for analysis, without having to make assumptions about the kinds of analysis people might use the synthetic data for.

Univariate

  • Categorical variables: frequencies, relative frequencies

  • Numeric variables means, standard deviations, skewness, kurtosis (i.e. first four moments), percentiles, and number of zero/non-zero values

  • It is also useful to visually compare univariate distributions using histograms or density plots.

Bivariate

Correlation Fit: Measures how well the synthesizer recreates the linear relationships between variables in the confidential dataset.

  • Create correlation matrices for the synthetic data and confidential data. Then measure differences across synthetic and actual data. Those differences are often summarized across all variables using L1 or L2 distance.

  • Advanced measures like relative mutual information can be used to measure the relationships between categorical variables.

Multivariate

Discriminant based methods: Can a model distinguish (i.e. discriminate) between records from the confidential vs synthetic data?

  • The confidential data and synthetic data should theoretically be drawn from the same super population.

  • Basic idea is to take the confidential data and combine with the synthetic data, and see if a model can distinguish (i.e., discriminate) between the two.

  • If the data synthesis process is good, then hopefully a model won’t be able to distinguish between the two.

  • As a visual example of how these discriminant based methods work, imagine that we generated a really good synthetic dataset that closely aligned with the confidential data. These are what the general discriminant based utility metrics would look like.

  • And if we generated a pretty poor synthetic dataset, these are what the general discriminant based utility metrics would look like:

  • For all the below discriminant based methods, we generate propensity scores (i.e. the probability that a particular data point belongs to the confidential data) using a classifier model.

  • The first few steps for all the specific methods outlined below are the same:

    1. Combine the synthetic and confidential data. Add an indicator variable with 0 for the confidential data and 1 for the synthetic data

      species bill_length_mm sex ind
      Chinstrap 49.5 male 0
      ... ... ... ...
      Adelie 46.0 male 1


    1. Calculate propensity scores (i.e. probabilities for group membership) for whether a given row belong to the synthetic dataset, typically with a classifier like logistic regression or CART.

      species bill_length_mm sex ind prop_score
      Chinstrap 49.5 male 0 0.32
      ... ... ... ... ...
      Adelie 46.0 male 1 0.64


  • These propensity scores can be used to calculate various metrics for general utility, some of which are described below:

  • pMSE: Calculates the average Mean Squared Error (MSE) between the propensity scores and the expected probabilities:

  • Proposed by Woo et al. (Woo et al. 2009) and enhanced by Snoke et al. (Snoke et al. 2018a)

  • After doing steps 1) and 2) above:

    1. Calculate expected probability, i.e. the share of synthetic data in the combined data. In the cases where the synthetic and confidential datasets are of equal size, this will always be 0.5.

      species bill_length_mm sex ind prop_score exp_prob
      Chinstrap 49.5 male 0 0.32 0.5
      ... ... ... ... ... ...
      Adelie 46.0 male 1 0.64 0.5


    1. Calculate pMSE, which is mean squared difference between the propensity scores and expected probabilities.

    \[pMSE = \frac{(0.32 - 0.5)^2 + ... + (0.64-0.5)^2}{N} \]

  • Often people use the pMSE ratio, which is the average pMSE score across all records, divided by the null model (Snoke et al. 2018b).

  • The null model is the the expected value of the pMSE score under the best case scenario when the model used to generate the data reflects the confidential data perfectly.

  • pMSE ratio = 1 means that your synthetic data and confidential data are indistinguishable, although values this low are almost never achieved.






  • SPECKS: Synthetic data generation; Propensity score matching; Empirical Comparison via the Kolmogorov-Smirnov distance. After generating propensity scores (i.e. steps 1 and 2 from above), you:

    1. Calculate the empirical CDF’s of the propensity scores for the synthetic and confidential data, separately.

    2. Calculate the Kolmogorov-Smirnov (KS) distance between the 2 empirical CDFs. The KS distance is the maximum vertical distance between 2 empirical CDF distributions.






  • ROC curves: After generating propensity scores (i.e. steps 1 and 2 from above), you can create the ROC (Receiver Operating Characteristic curve) for the classifier and use that to evaluate how well your synthetic data mimics the confidential data.

  • AUC: Area under the Receiver Operating Curve, a summary of how good your discriminator is.

  • In our context, High AUC = good at discriminating = poor synthesis.

  • We want in the best case, AUC = 0.5 because that means the discriminator is no better than a random guess

  • It is useful to look at variable importance for predictive models when observing poor discriminant based metrics. Variable importance can help diagnose which variables are poorly synthesized.











Exercise 1: Calculating Utility Metrics

Assume that you have a confidential dataset of the starwars data, which is named conf_data below. You have already synthesized a fully synthetic dataset, named synth_data based on the confidential data. The conf_data looks like:

gender height mass
masculine 172 77
masculine 167 75
... ... ...


And synth_data looks like:

gender height mass
masculine 163.2612 99.68595
masculine 150.4994 92.96685
... ... ...


Question 1: Calculate the correlation fit between the synthetic and confidential data. Fill in the blanks and run the code below.

Code
# Fill in the blanks below:

# The cor() function can take in a dataframe and compute correlations 
# between all columns in the dataframe and spit out a correlation matrix
conf_data_correlations = cor(###)
synth_data_correlations = cor(###)

correlation_differences = conf_data_correlations - synth_data_correlations

# Correlation fit is the sum of the sqrt of the squared differences between each correlation in the difference matrix.
cor_fit = sum(sqrt( ### ^2))

cor_fit

Question 2: Compare the univariate distributions for mass and height in the confidential and synthetic data using density plots. Fill in the blanks and run the code below.

Code
combined_data = bind_rows("synthetic" = synth_data, 
                          "confidential" = conf_data,
                          .id = "type")

# Create a density plot of the mass distributions
combined_data |> 
  ggplot(aes(x = ###,
             fill = type,),
         position = "dodge",
         color = "white") +
  geom_density(alpha = 0.4)

# Create a density plot of the height distributions
combined_data |> 
  ggplot(aes(x = ###,
             fill = type,),
         position = "dodge",
         color = "white") +
  geom_density(alpha = 0.4)


Specific Utility Metrics

  • Specific utility metrics measure how suitable a synthetic dataset is for specific analyses.

  • These specific utility metrics will change from dataset to dataset, depending on what you’re using the data for.

  • A helpful rule of thumb: general utility metrics are useful for the data synthesizers to be convinced that they’re doing a good job. Specific utility metrics are useful to convince downstream data users that the data synthesizers are doing a good job.

  • Some examples of specific utility metrics, though again these will vary dramatically, are below.

Recreating Inferences

  • It can be useful to compare statistical analyses on the confidential data and synthetic data:
    • Do the estimates have the same sign?
    • Do the estimates have the same statistical inference at a common \(\alpha\) level?
    • Do the confidence intervals for the estimates overlap?
  • Each of these questions is useful. (Barrientos et al. 2021) combine all three questions into sign, significance, and overlap (SSO) match.

Regression confidence interval overlap:

  • Measure of the overlap between confidence intervals for each coefficient in a linear regression model or logistic regression model estimated on the original data and a model estimated on the synthetic data.

  • Note that regression confidence interval can be 0 or even negative when intervals don’t overlap at all.

  • The value of confidence interval overlap diminishes when disclosure control methods generate very wide confidence intervals.


Microsimulation results

  • The Urban Institute and Tax Policy Center are heavy users of microsimulation.

  • When synthesizing administrative tax data, we compare microsimulation results from tax calculators applied to the confidential data and synthetic data.


Disclosure Risk Metrics

  • How do we evaluate how well the synthetic data controls disclosure risks? That’s where disclosure metrics come in.

Identity Disclosure Metrics

  • Big picture: How often can we correctly re-identify confidential records from synthetic records for partially synthetic data?

  • For fully synthetic datasets, there is no one to one relationship between individuals and records so identity disclosure risk is a little ill-defined. Generally identity disclosure risk applies to partially synthetic datasets (or datasets protected with traditional SDC methods).

  • Most of these metrics rely on data maintainers essentially performing attacks against their synthetic data and seeing how successful they are at identifying individuals.

Basic matching approaches

  • We start by making assumptions about the knowledge an attacker has (i.e. external publicly accessible data they have access to).

  • For each confidential record, the data attacker identifies a set of partially synthetic records which they believe contain the target record (i.e. potential matches) using the external variables as matching criteria.

  • There are distance-based and probability-based algorithms that can perform this matching. This matching process could be based on exact matches between variables or some relaxations (i.e. matching continuous variables within a certain radius of the target record, or matching adjacent categorical variables).

  • We then evaluate how accurate our re-identification process was using a variety of metrics.

As a simple example for the metrics we’re about to cover, imagine a data attacker has access to the following external data:

homeworld species name
Naboo Gungan Jar Jar Binks
Naboo Droid R2-D2


And imagine that the partially synthetic released data looks like this:

homeworld species skin_color
Tatooine Human fair
Tatooine Droid gold
Naboo Droid white, blue
Tatooine Human white
Alderaan Human light
Tatooine Human light


Note that the released partially synthetic data does not have names. But using some basic matching rules in combination with the external data, an attacker is able to identify the following potential matches for Jar Jar Binks and R2D2, two characters in the Starwars universe:

homeworld species skin_color
Potential Jar Jar matches
Naboo Gungan orange
Naboo Gungan grey
Naboo Gungan green
Potential R2-D2 Matches
Naboo Droid white, blue


And since we are the data maintainers, we can take a look at the confidential data and know that the highlighted rows are “true” matches.

homeworld species skin_color
Potential Jar Jar matches
Naboo Gungan orange
Naboo Gungan grey
Naboo Gungan green
Potential R2-D2 Matches
Naboo Droid white, blue


These matches above are counted in various ways to evaluate identity disclosure risk. Below are some of those specific metrics. Generally for a good synthesis, we want a low expected match rate and true match rate, and a high false match rate.


  • Expected Match Rate: On average, how likely is it to find a “correct” match among all potential matches? Essentially, the expected number of observations in the confidential data expected to be correctly matched by an intruder.

    • Higher expected match rate = higher identification disclosure risk.

    • The two other risk metrics below focus on the subset of confidential records for which the intruder identifies a single match.

    • In our example, this is \(\frac{1}{3} + 1 = 1.333\)



  • True Match Rate: The proportion of true unique matches among all confidential records. Higher true match rate = higher identification disclosure risk.

  • Assuming there are 100 rows in the confidential data in our example, this is \(\frac{1}{100} = 1\%\)





  • False Match Rate: The proportion of false matches among the set of unique matches. Lower false match rate = higher identification disclosure risk.

  • In our example, this is \(\frac{0}{1} = 0\%\)





Attribute Disclosure risk metrics

  • We were able to learn about Jar Jar and R2D2 by re-identifying them in the data. It is possible to learn confidential attributes without perfectly re-identifying observations in the data.

Predictive Accuracy

  • Big picture: How well can we predict a sensitive attribute in a data set using the synthetic data (and external data)

  • Similar to above, you start by matching synthetic records to confidential records. Alternatively, you can build a predictive model using the synthetic data to make predictions on the confidential data.

  • key variables: Variables that an attacker already knows about a record and can use to match.

  • target variables: Variables that an attacker wishes to know more or infer about using the synthetic data.

  • Pick a sensitive variable in the confidential data and use the synthetic data to make predictions. Evaluate the accuracy of the predictions.



Membership Inference Tests

  • Membership Inference Tests: Can we perform a membership attack to determine if a particular record is in the confidential data?

    • Why is this important? Sometimes membership in a synthetic dataset is also confidential (e.g. a dataset of HIV positive patients or people who have experienced homelessness).

    • Also particularly useful for fully synthetic data where identity disclosure and attribute disclosure metrics don’t really make a lot of sense.

    • Assumes that attacker has access to a subset of the confidential data, and wants to tell if one or more records was used to generate the synthetic data.

    • Since we as data maintainers know the true answers, we can evaluate whether the attackers guess is true and can break it down many ways (e.g. true positives, true negatives, false positives or false negatives).

      source for figure: Mendelevitch and Lesh (2021)

    • The “close enough” threshold is usually determined by a custom distance metric, like edit distance between text variables or numeric distance between continuous variables.

    • Often you will want to choose different distance thresholds and evaluate how your results change.

Copy Protection

  • Copy Protection Metrics: Is our synthesizer simply memorizing the confidential data? i.e. Are our models too good?

    • Distance to Closest record: Measures distance between each real record (\(r\)) and the closest synthetic record (\(s_i\)), as determined by a distance calculation.

      • Many common distance metrics used in the literature including Euclidean distance, cosine distance, Gower distance, or Hamming distance (Mendelevitch and Lesh 2021).

      • Goal of this metric is to easily expose exact copies or simple perturbations of the real records that exist in the synthetic dataset.

      • Note that having DCR = 0, doesn’t necessarily mean a high disclosure risk because in some datasets the “space” spanned by the variables in scope is relatively small.



Hold Out Data

Holdout Data

Membership inference tests and copy protection metrics are informative but lack context. When possible, create a holdout data set similar to the training data. Then calculate membership inference tests and copy protections metrics replacing the synthetic data with the hold out data. The results are useful for benchmarking the original membership inference tests and copy protection metrics.

Exercise 2: Disclosure Metrics

Following the same example with Jar Jar Binks above, let’s assume that using external data an attacker was able to identify these 3 potential matches for Jar Jar Binks in the data. And because we have access to the confidential data, we know that the row in pink is a “correct” match.

homeworld species skin_color
Naboo Gungan green
Naboo Gungan green
Naboo Gungan grey


Question 1: If an attacker randomly chooses one of these matches to be Jar Jar Binks, what is the probability they will be right?

Question 2: Assume that previously an attacker did not know the skin_color of Jar Jar Binks. Using this list of matches, what approaches could an attacker take to guess the skin_color of Jar Jar Binks? What is the accuracy of each of those approaches?


Case Studies

Fully Synthetic PUF for IRS Non-Filers (Bowen et al. 2020)

  • Data: A 2012 file of “non-filers” created by the IRS Statistics of Income Division.
  • Motivation: Non-filer information is important for modeling certain tax reforms and this was a proof-of-concept for a more complex file.
  • Methods: Sequential CART models with additional noise added based on the sparsity of nearby observations in the confidential distribution.
  • Important metrics:
    • General utility: Proportions of non-zero values, first four moments, correlation fit
    • Specific utility: Tax microsimulation, regression confidence interval overlap
    • Disclosure: Node heterogeneity in the CART model, rates of recreating observations
  • Lessons learned:
    • Synthetic data can work well for tax microsimulation.
    • It is difficult to match certain utility metrics for sparse variables.

Suggested Reading

Snoke, Joshua, Gillian M Raab, Beata Nowok, Chris Dibben, and Aleksandra Slavkovic. 2018b. “General and Specific Utility Measures for Synthetic Data.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 181 (3): 663–88.

Bowen, Claire McKay, Victoria Bryant, Leonard Burman, Surachai Khitatrakun, Robert McClelland, Philip Stallworth, Kyle Ueyama, and Aaron R Williams. 2020. “A Synthetic Supplemental Public Use File of Low-Income Information Return Data: Methodology, Utility, and Privacy Implications.” In International Conference on Privacy in Statistical Databases, 257–70. Springer.

References

Barrientos, Andrés F., Aaron R. Williams, Joshua Snoke, and Claire McKay Bowen. 2021. “A Feasibility Study of Differentially Private Summary Statistics and Regression Analyses with Evaluations on Administrative and Survey Data.” https://doi.org/10.48550/ARXIV.2110.12055.
Bowen, Claire McKay, Victoria Bryant, Leonard Burman, Surachai Khitatrakun, Robert McClelland, Philip Stallworth, Kyle Ueyama, and Aaron R Williams. 2020. “A Synthetic Supplemental Public Use File of Low-Income Information Return Data: Methodology, Utility, and Privacy Implications.” In International Conference on Privacy in Statistical Databases, 257–70. Springer.
Mendelevitch, Ofer, and Michael D Lesh. 2021. “Fidelity and Privacy of Synthetic Medical Data.” arXiv Preprint arXiv:2101.08658.
Snoke, Joshua, Gillian M. Raab, Beata Nowok, Chris Dibben, and Aleksandra Slavkovic. 2018a. “General and Specific Utility Measures for Synthetic Data.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 181 (3): 663–88. https://doi.org/10.1111/rssa.12358.
Snoke, Joshua, Gillian M Raab, Beata Nowok, Chris Dibben, and Aleksandra Slavkovic. 2018b. “General and Specific Utility Measures for Synthetic Data.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 181 (3): 663–88.
Woo, Mi-Ja, Jerome P Reiter, Anna Oganian, and Alan F Karr. 2009. “Global Measures of Data Utility for Microdata Masked for Disclosure Limitation.” Journal of Privacy and Confidentiality 1 (1).